Recommender Systems

In this project, we build a movie recommender system. We read a dataset of movie ratings by users, then we select other movies that a specific user would be interesting in based on his previous choice.


In [3]:
import numpy as np
import pandas as pd

Read the data


In [4]:
column_names = ['user_id', 'item_id', 'rating', 'timestamp']
df = pd.read_csv('u.data', sep='\t', names=column_names)

In [5]:
df.head()


Out[5]:
user_id item_id rating timestamp
0 0 50 5 881250949
1 0 172 5 881250949
2 0 133 1 881250949
3 196 242 3 881250949
4 186 302 3 891717742

Get movie titles


In [6]:
movie_titles = pd.read_csv("Movie_Id_Titles")
movie_titles.head()


Out[6]:
item_id title
0 1 Toy Story (1995)
1 2 GoldenEye (1995)
2 3 Four Rooms (1995)
3 4 Get Shorty (1995)
4 5 Copycat (1995)

Merged dataframes


In [7]:
df = pd.merge(df,movie_titles,on='item_id')
df.head()


Out[7]:
user_id item_id rating timestamp title
0 0 50 5 881250949 Star Wars (1977)
1 290 50 5 880473582 Star Wars (1977)
2 79 50 4 891271545 Star Wars (1977)
3 2 50 5 888552084 Star Wars (1977)
4 8 50 5 879362124 Star Wars (1977)

Exploratory Data Analysis


In [8]:
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('white')
%matplotlib inline

Create a ratings dataframe with average rating and number of ratings


In [9]:
df.groupby('title')['rating'].mean().sort_values(ascending=False).head()


Out[9]:
title
Marlene Dietrich: Shadow and Light (1996)     5.0
Prefontaine (1997)                            5.0
Santa with Muscles (1996)                     5.0
Star Kid (1997)                               5.0
Someone Else's America (1995)                 5.0
Name: rating, dtype: float64

In [10]:
df.groupby('title')['rating'].count().sort_values(ascending=False).head()


Out[10]:
title
Star Wars (1977)             584
Contact (1997)               509
Fargo (1996)                 508
Return of the Jedi (1983)    507
Liar Liar (1997)             485
Name: rating, dtype: int64

In [11]:
ratings = pd.DataFrame(df.groupby('title')['rating'].mean())
ratings.head()


Out[11]:
rating
title
'Til There Was You (1997) 2.333333
1-900 (1994) 2.600000
101 Dalmatians (1996) 2.908257
12 Angry Men (1957) 4.344000
187 (1997) 3.024390

Number of ratings column


In [12]:
ratings['num of ratings'] = pd.DataFrame(df.groupby('title')['rating'].count())
ratings.head()


Out[12]:
rating num of ratings
title
'Til There Was You (1997) 2.333333 9
1-900 (1994) 2.600000 5
101 Dalmatians (1996) 2.908257 109
12 Angry Men (1957) 4.344000 125
187 (1997) 3.024390 41

Data Visualization: Histogram


In [13]:
plt.figure(figsize=(10,4))
ratings['num of ratings'].hist(bins=70)


Out[13]:
<matplotlib.axes._subplots.AxesSubplot at 0x29b68281518>

In [14]:
plt.figure(figsize=(10,4))
ratings['rating'].hist(bins=70)


Out[14]:
<matplotlib.axes._subplots.AxesSubplot at 0x29b686316a0>

In [15]:
sns.jointplot(x='rating',y='num of ratings',data=ratings,alpha=0.5)


Out[15]:
<seaborn.axisgrid.JointGrid at 0x29b68668e48>

Recommending Similar Movies


In [16]:
moviemat = df.pivot_table(index='user_id',columns='title',values='rating')
moviemat.head()


Out[16]:
title 'Til There Was You (1997) 1-900 (1994) 101 Dalmatians (1996) 12 Angry Men (1957) 187 (1997) 2 Days in the Valley (1996) 20,000 Leagues Under the Sea (1954) 2001: A Space Odyssey (1968) 3 Ninjas: High Noon At Mega Mountain (1998) 39 Steps, The (1935) ... Yankee Zulu (1994) Year of the Horse (1997) You So Crazy (1994) Young Frankenstein (1974) Young Guns (1988) Young Guns II (1990) Young Poisoner's Handbook, The (1995) Zeus and Roxanne (1997) unknown Á köldum klaka (Cold Fever) (1994)
user_id
0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 NaN NaN 2.0 5.0 NaN NaN 3.0 4.0 NaN NaN ... NaN NaN NaN 5.0 3.0 NaN NaN NaN 4.0 NaN
2 NaN NaN NaN NaN NaN NaN NaN NaN 1.0 NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 NaN NaN NaN NaN 2.0 NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

5 rows × 1664 columns

Most rated movies


In [17]:
ratings.sort_values('num of ratings',ascending=False).head(10)


Out[17]:
rating num of ratings
title
Star Wars (1977) 4.359589 584
Contact (1997) 3.803536 509
Fargo (1996) 4.155512 508
Return of the Jedi (1983) 4.007890 507
Liar Liar (1997) 3.156701 485
English Patient, The (1996) 3.656965 481
Scream (1996) 3.441423 478
Toy Story (1995) 3.878319 452
Air Force One (1997) 3.631090 431
Independence Day (ID4) (1996) 3.438228 429

We choose two movies: starwars, a sci-fi movie. And Liar Liar, a comedy.


In [18]:
ratings.head()


Out[18]:
rating num of ratings
title
'Til There Was You (1997) 2.333333 9
1-900 (1994) 2.600000 5
101 Dalmatians (1996) 2.908257 109
12 Angry Men (1957) 4.344000 125
187 (1997) 3.024390 41

Now let's grab the user ratings for those two movies:


In [19]:
starwars_user_ratings = moviemat['Star Wars (1977)']
liarliar_user_ratings = moviemat['Liar Liar (1997)']
starwars_user_ratings.head()


Out[19]:
user_id
0    5.0
1    5.0
2    5.0
3    NaN
4    5.0
Name: Star Wars (1977), dtype: float64

Using corrwith() method to get correlations between two pandas series:


In [20]:
similar_to_starwars = moviemat.corrwith(starwars_user_ratings)
similar_to_liarliar = moviemat.corrwith(liarliar_user_ratings)


C:\Users\Luiz Henrique\AppData\Roaming\Python\Python35\site-packages\numpy\lib\function_base.py:2487: RuntimeWarning: Degrees of freedom <= 0 for slice
  warnings.warn("Degrees of freedom <= 0 for slice", RuntimeWarning)

Clear data by removing NaN values and using a DataFrame instead of a series


In [21]:
corr_starwars = pd.DataFrame(similar_to_starwars,columns=['Correlation'])
corr_starwars.dropna(inplace=True)
corr_starwars.head()


Out[21]:
Correlation
title
'Til There Was You (1997) 0.872872
1-900 (1994) -0.645497
101 Dalmatians (1996) 0.211132
12 Angry Men (1957) 0.184289
187 (1997) 0.027398

In [22]:
corr_starwars.sort_values('Correlation',ascending=False).head(10)


Out[22]:
Correlation
title
Hollow Reed (1996) 1.0
Stripes (1981) 1.0
Star Wars (1977) 1.0
Man of the Year (1995) 1.0
Beans of Egypt, Maine, The (1994) 1.0
Safe Passage (1994) 1.0
Old Lady Who Walked in the Sea, The (Vieille qui marchait dans la mer, La) (1991) 1.0
Outlaw, The (1943) 1.0
Line King: Al Hirschfeld, The (1996) 1.0
Hurricane Streets (1998) 1.0

Filtering out movies that have less than 100 reviews (this value was chosen based off the histogram). This is needed to get more accurate results


In [23]:
corr_starwars = corr_starwars.join(ratings['num of ratings'])
corr_starwars.head()


Out[23]:
Correlation num of ratings
title
'Til There Was You (1997) 0.872872 9
1-900 (1994) -0.645497 5
101 Dalmatians (1996) 0.211132 109
12 Angry Men (1957) 0.184289 125
187 (1997) 0.027398 41

Now sort the values


In [24]:
corr_starwars[corr_starwars['num of ratings']>100].sort_values('Correlation',ascending=False).head()


Out[24]:
Correlation num of ratings
title
Star Wars (1977) 1.000000 584
Empire Strikes Back, The (1980) 0.748353 368
Return of the Jedi (1983) 0.672556 507
Raiders of the Lost Ark (1981) 0.536117 420
Austin Powers: International Man of Mystery (1997) 0.377433 130

The same for the comedy Liar Liar:


In [25]:
corr_liarliar = pd.DataFrame(similar_to_liarliar,columns=['Correlation'])
corr_liarliar.dropna(inplace=True)
corr_liarliar = corr_liarliar.join(ratings['num of ratings'])
corr_liarliar[corr_liarliar['num of ratings']>100].sort_values('Correlation',ascending=False).head()


Out[25]:
Correlation num of ratings
title
Liar Liar (1997) 1.000000 485
Batman Forever (1995) 0.516968 114
Mask, The (1994) 0.484650 129
Down Periscope (1996) 0.472681 101
Con Air (1997) 0.469828 137